Classifying the CIFAR-10 Dataset with Transfer Learning (and Tensorflow Keras)

“Research is what I’m doing when I don’t know what I’m doing.” — Wernher von Braun

Abstract

In this blog post, I will share my journey of developing a Python script that utilizes transfer learning to train a Convolutional Neural Network (CNN) to classify the CIFAR-10 dataset. The goal was to achieve a validation accuracy of 87% or higher, using only TensorFlow’s Keras API and one of the pre-trained models from Keras Applications. The journey was filled with trials, errors, laughter, and a lot of learning. It involved setting up a Docker environment, experimenting with different models, data augmentation, fine-tuning, and tweaking hyperparameters of course. Although I am admittedly a noob, the results surprised me (the first training went very well for such a simple model)!

Introduction

The goal was to train a model that could accurately classify images from the CIFAR-10 dataset. However, the challenge was not just to build a model, but to build it in a specific way: using only TensorFlow’s Keras API, using one of the pre-trained models from Keras Applications, and saving the trained model in the current working directory as ‘cifar10.h5’. Some additional technical rules were: the model should be compiled, should not run when the file is imported, and should have a validation accuracy of 87% or higher.

Understanding the CIFAR-10 Dataset

The CIFAR-10 dataset is a well-known dataset in the machine learning community. It is often used as a benchmark for image classification algorithms. The dataset consists of 60,000 32x32 color images, divided into 10 classes, with 6,000 images per class. The classes are mutually exclusive and include various types of objects and animals, such as airplanes, automobiles, birds, cats, deer, dogs, frogs, horses, ships, and trucks.

The images in the CIFAR-10 dataset are low-resolution (32x32 pixels), which makes the classification task challenging. The images are also diverse and varied, which means that a good model needs to be able to recognize a wide range of features and patterns.

Materials and Methods

Model Selection Process

The first step in the model selection process was to decide which pre-trained model to choose from Keras Applications. This was a great tool to have available, as it allowed me to leverage the features that the model has already learned from a large image dataset like ImageNet. This approach, known as transfer learning, is particularly effective when the new task is similar to the task that the pre-trained model was trained on.

I started with the idea of training several different models of different ‘compute loads’. That is, I wanted to create a very simple network that would train very quickly, I wanted a ‘mid-tier’ net that implements all tools but with a shallow network and low epochs, and a very robust and deep net with high epoch count; all nets must at least pass the mandatory 87% accuracy mark. After finalizing the plan, I chose to start by building the simple network, which is where the decision to use the pretrained MobileNetV2 model from Keras Applications as the base model for our task came from. MobileNetV2 is the smallest model by megabyte count with the fewest amount of parameters out of all models on keras applications. Thus, ‘lilTinyNet’ was born. The ‘mid-tier’ was decided to be one of the EfficientNet models and the ‘large-tier’ was supposed to be ResNet50, VGG16, or InceptionResNetV2. Things rarely go to plan of course, and what ended up happening is that the performance from the first run of ‘lilTinyNet’ went so well that after trying 5 other pre-trained models (with accuracy over 87%, but not near ‘lilTinyNet’) I decided instead to use as many tools as I could to expand ‘lilTinyNet’ while changing things like the optimizer and other hyperparameters (epoch size, batch size, learning rate schedule). Therefore, all models and code covered will be the result of transfer learning from MobileNetV2.

Setting up the Environment (skip if you don’t want to read my gpu rant xD)

The first step in this journey was setting up the environment. I started with Google Colab, but soon realized that I wanted to unleash the full power of my own personal GPU for some deep learning fun. Besides, who doesn’t love a good experiment? So I decided to ditch the cloud and go local. But don’t worry, if you’re still wanting to use that sweet, free Colab GPU, I have a small guide on how to use it for training the model below.

I was also curious about how my RTX 2060 super would compare to the free GPU that Google Colab offers. I mean, it’s free, but is it fast? And how much would it cost me in terms of wear-and-tear on my precious GPU? I’ve heard horror stories of GPUs dying after intensive training sessions, but I’ve also put my GPU through some serious gaming and overclocking tests, and it always came out alive and kicking. So I figured it could handle some deep learning as well, as long as I kept an eye on the load and temperature.

But before I risked frying my GPU, I did what any sensible person would do: I asked Reddit. And lo and behold, the wise and anonymous redditors agreed with me that it should be fine, as long as I was careful and monitored the situation. So, seeing as they agreed with me, I took their word as gospel and immediately put my GPU back to work! (Disclaimer: don’t sue me if your GPU explodes or something. But please inform me about it, because that’s hilarious... I didn’t know they can actually explode. Do your own research and/or be prepared for the consequences!)

A tip: I have nice fans on my GPU, so I cranked them up to 100% and lowered the power consumption cap to 85–90% in my graphics card management software. This kept my GPU at a cool maximum of 72 degrees during training! Note: most of my training sessions were very short (average 15m) or cut off short.

Also, to be honest, I had never used Docker before this project and saw it as an opportunity to learn something new. Plus, it was actually necessary, because WSL2 Ubuntu doesn’t play nice with my GPU without Docker. That being said, I’m a total Docker noob!

Quick Docker set-up (For those interested in training the model using their own GPU in WSL2 Ubuntu)

Now, let’s dive into setting up Docker. Docker is a platform that allows you to automate the deployment, scaling, and management of applications using containerization. It’s a great tool to ensure that your application runs the same, regardless of the environment.

To use Docker, you first need to install it. (WINDOWS) You can download Docker Desktop for Windows from the official Docker website. After installation, you’ll need to enable the WSL 2 backend. This can be done from the Docker Desktop settings.

Once Docker is set up, you can run a Docker container with GPU support using the following command in your terminal (I use inside VSCode). This will (put simply) open the Docker container within the directory you are located in the terminal:

docker run -it --gpus all -v $(pwd):/scripts tensorflow/tensorflow:latest-gpu bash

This command does a few things:

  • docker run starts a new Docker container.

  • -it ensures that you're running Docker interactively (i.e., it will provide a terminal interface).

  • --gpus all allows the Docker container to access your GPU.

  • -v $(pwd):/scripts mounts the current directory ($(pwd)) to the '/scripts' directory in the Docker container. This means that all the files in your current directory will be accessible in the '/scripts' directory in the Docker container.

  • tensorflow/tensorflow:latest-gpu is the Docker image that the container is based on. This image comes with TensorFlow pre-installed and is configured to use a GPU.

  • bash starts a bash shell inside the Docker container.

This command is quite handy as it not only runs a Docker container with GPU support but also mounts the current directory to the ‘/scripts’ directory in the container. This means that any files saved in the Docker container will actually be saved in the directory you were in when you started the Docker session!

Also, look up ‘Docker file’ when you get a chance.

Running the Code in Google Colab

Google Colab is a free cloud service that provides a coding environment for AI researchers. It comes with GPU support and is a great tool for anyone who wants to experiment with machine learning and deep learning without setting up their own environment.

Look the left file directory for a google drive logo. Click it to mount your drive so your model will save to the drive of your google account.

To run your code in Google Colab, follow these steps:

  1. Go to the Google Colab website and sign in with your Google account.

  2. Click on ‘File’ -> ‘New notebook’ to create a new notebook.

  3. You can now write your code in the cells. You can add new cells by clicking on ‘+ Code’ or ‘+ Text’ for code and text cells respectively.

  4. To run a cell, click on the play button on the left side of the cell or press ‘Shift+Enter’.

  5. To use a GPU, click on ‘Runtime’ -> ‘Change runtime type’, select ‘GPU’ under ‘Hardware accelerator’, and then click on ‘Save’.

Remember to save your work regularly. Google Colab notebooks are saved to your Google Drive.

In conclusion, whether you choose to use Docker or Google Colab largely depends on your specific needs and resources. Docker allows you to utilize your own GPU and provides a consistent environment, while Google Colab is a hassle-free option that comes with a free GPU.

Preprocessing the Data

The next step was to preprocess the data for the model. The function preprocess_data(X, Y) was written to preprocess the CIFAR-10 data and labels. The function uses the preprocess_input function from the Keras Applications to normalize the pixel values of the images and the to_categorical function from Keras utils to convert the labels into one-hot encoded vectors. As stated earlier, CIFAR-10 is a dataset of 60,000 32x32 color images in 10 different classes, with 6,000 images per class.

How to preprocess the data?

The data preprocessing consists of two main steps:

  1. Normalizing the pixel values of the images. This means scaling the values from 0 to 255 to a range between -1 and 1. This helps the model learn faster and more accurately, as it reduces the variance of the input data. The preprocess_input function from Keras Applications does this normalization for us, as it is designed for models that use MobileNetV2 as their base.

  2. Converting the labels into one-hot encoded vectors. This means transforming the labels from integers (0 to 9) to binary arrays of length 10, where only one element is 1 and the rest are 0. For example, the label 3 (cat) would be converted to [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]. This helps the model output probabilities for each class, as it can use a softmax activation function at the last layer. The to_categorical function from Keras utils does this conversion for us, as it takes the number of classes as an argument.

The code for the preprocessing function is:

def preprocess_data(X, Y):
    """
    Pre-processes the data for your model.

    X is a numpy.ndarray of shape (m, 32, 32, 3) containing the CIFAR 10 data,
      where m is the number of data points.
    Y is a numpy.ndarray of shape (m,) containing the CIFAR 10 labels for X.

    Returns: X_p, Y_p.
    X_p is a numpy.ndarray containing the preprocessed X.
    Y_p is a numpy.ndarray containing the preprocessed Y.
    """
    X_p = K.applications.mobilenet_v2.preprocess_input(X)
    Y_p = K.utils.to_categorical(Y, 10)
    return X_p, Y_p

Transfer Learning: What, When, and How?

Transfer learning is a machine learning technique where a pre-trained model is used on a new problem. It’s called transfer learning because the knowledge from the pre-trained model is transferred to the new problem.

What to transfer: In our case, we want to transfer the knowledge embedded in the convolutional layers of a pre-existing, pre-trained model, such as MobileNetV2, EfficientNet, InceptionV3, VGG16, ResNet50, or others available in Keras Applications. These models have been trained on large image datasets like ImageNet, and their convolutional layers have learned a robust set of features from thousands of images and hundreds of classes. These features range from simple edges and textures to more complex ones like object parts. These learned features are generally transferable to other image recognition tasks, including our CIFAR-10 classification problem.

When to transfer: in our scenario we will be working with the CIFAR-10 dataset which, while varied and well-labelled, is comparatively small and lacks the diversity found in large-scale datasets like ImageNet. Training a deep learning model from scratch on a smaller dataset may lead to overfitting, meaning the model may not generalize well to unseen data. Additionally, training deep learning models from scratch requires significant computational resources and time. So, by using a pre-trained model, we can leverage the features it has learned and save on training time and resources, making transfer learning an attractive choice in our context.

How to transfer: Here’s how we plan to transfer the knowledge from the pre-existing model to our task:

  1. Select a pre-trained model: As mentioned, we can use a model available in Keras Applications.

  2. Preprocess the CIFAR-10 dataset: Since CIFAR-10 images are smaller than what the pre-existing models are trained on, we’ll need to resize the images. Also, we’ll need to normalize the pixel values and one-hot encode the labels.

  3. Freeze the convolutional base of the pre-trained model: This involves setting the trainable attribute of the model layers to False to preserve the weights and biases.

  4. Add a new classifier on top of the pre-trained model: We’ll add a few layers that will be trained on our specific task. These layers should end with a dense layer with 10 units (one for each class in CIFAR-10) with a softmax activation function to output class probabilities.

  5. Compile and train the model: Compile the model with an appropriate optimizer and loss function, and then train it on the CIFAR-10 data.

  6. Fine-tune the model (optional): Once the top layers are well-trained, we could unfreeze a few layers in the pre-trained model and train it further with a very low learning rate to fine-tune the model to our specific task.

It’s important to keep track of the model’s performance (accuracy) on a validation set during training to ensure the model is learning well and not overfitting. That is, if the accuracy is going up, but the validation loss is going up while the validation accuracy is going down, then the model is ‘overfitting’, meaning that its ability to actually recognize images in the ‘real-world’ is not as high as the accuracy makes it seem; the model is ‘overfitting’ to the training data and therefore is actually getting too ‘accustomed’ to trends in the training data.
This is a good place to note that one can set early stopping to make sure that the model will stop training after enough consecutive epochs with increasing validation loss. This is especially important if one needs to grab a drink and a snack, followed by a ‘short video’, followed by a nap, followed by forgetting what you were doing…. This is not an exact, personal example. Anyway, in our case, the code for early stopping looks like:

early_stopping = K.callbacks.EarlyStopping(patience=3, restore_best_weights=True)

This code is making it so after 3 consecutive epochs with validation loss the model will quit training and save.

Also, checkpoint saving was implemented in the code, so no matter if the training goes to completion, the model will be saved as the epoch with the lowest validation loss and its associated accuracy. This code looks like:

model_checkpoint = K.callbacks.ModelCheckpoint('cifar10.h5', save_best_only=True)

Building the Models

The Birth of ‘lilTinyNet’

The first model I built, which I affectionately named ‘lilTinyNet’, was based on the MobileNetV2 model from Keras Applications. The model was compiled with the RMSprop optimizer with a learning rate of 0.001, the categorical cross entropy loss function, and the accuracy metric. The model was trained for up to 10 epochs with a batch size of 32, but the training was stopped early if the validation loss did not improve for 3 consecutive epochs. The best model weights were saved using a model checkpoint callback.

Model Architecture

‘lilTinyNet’ was built using the Keras library with TensorFlow as the backend. The model architecture was based on the MobileNetV2 architecture, which was pre-trained on the ImageNet dataset. The top layers of the MobileNetV2 model were replaced with custom dense layers to adapt the model to the CIFAR-10 classification task.

The model structure was defined as follows:

  1. A lambda layer was used to resize images from 32x32 to 128x128 to match the input size that MobileNetV2 was trained on.

  2. The base model (MobileNetV2) was added.

  3. The output of the base model was flattened to 1 dimension.

  4. A dense layer with 1024 units and ReLU activation was added.

  5. Dropout was applied to prevent overfitting.

  6. A final dense layer with 10 units (for the 10 classes) was added with softmax activation to output probabilities for the classes.

Model Training

The model was trained with early stopping and model checkpointing. Early stopping was used to prevent overfitting by stopping the training process when the validation performance stopped improving. Model checkpointing was used to save the model weights at the end of each epoch if the model’s performance on the validation set had improved.

A learning rate scheduler callback was also used. This callback function adjusted the learning rate according to a schedule. Specifically, the learning rate was reduced by an order of magnitude (factor of 10) every 5 epochs. This helped to achieve better convergence of the model. Oddly enough, I initially utilized this but planned to switch to learning rate decay, but the results with the learning rate scheduler were oddly better, though that is surely mostly coder error :) The code for the learning rate scheduler is as follows:

class LearningRateScheduler(K.callbacks.Callback):
    """Learning rate scheduler callback"""
    def on_epoch_end(self, epoch, logs=None):
        if (epoch+1) % 5 == 0:
            lr = K.backend.get_value(self.model.optimizer.lr)
            K.backend.set_value(self.model.optimizer.lr, lr * 0.1)
            print(" ...Adjusted learning rate to:", lr*0.1)

The model was trained for 10 epochs with a batch size of 32. The training and validation data were the preprocessed CIFAR-10 data.

Results and Insights

‘lilTinyNet’ reached its peak training in the 7th epoch with 92.17% accuracy and 0.2936 validation loss in 11m53s of training! This was the first model that I ever really trained, so I had no idea how fast that was before I beat my head up against the wall for days trying to beat that accuracy with 5 ‘new and improved’ models. That is, until….

‘lilTinyNet’ evolved into a ‘megaNet’!

After much trial and error, the final form was ascended to, dubbed ‘megaNet’, and it indeed is a much more sophisticated version of ‘lilTinyNet’. The development of ‘megaNet’ involved the incorporation of data augmentation techniques, fine-tuning of the base model, batch normalization in the layers added to the pre-trained model, batch size increased to 64 from 32, and a switch to Adam optimization.

Data Augmentation

One of the key enhancements in ‘megaNet’ was the use of data augmentation techniques. Data augmentation is a strategy that enables practitioners to significantly increase the diversity of data available for training models, without actually collecting new data. This is particularly useful when dealing with image data, where the acquisition of new data can be costly and time-consuming.

In ‘megaNet’, I used Keras’s ImageDataGenerator to perform data augmentation. This included random rotations, width and height shifts, horizontal flips, zooming, and brightness adjustments. These transformations introduced variability in the training set, helping the model to generalize better to unseen data. The code looks as so:

    datagen = K.preprocessing.image.ImageDataGenerator(
        featurewise_center=False,  # Set input mean to 0 over the dataset
        featurewise_std_normalization=False,  # Divide inputs by std of the dataset
        rotation_range=10,  # Degree range for random rotations
        width_shift_range=0.1,  # Range for random horizontal shifts
        height_shift_range=0.1,  # Range for random vertical shifts
        horizontal_flip=True,  # Randomly flip inputs horizontally
        zoom_range=0.2,  # Range for random zoom
        brightness_range=[0.8, 1.2])  # Range for picking a brightness shift value
    datagen.fit(x_train)

Fine-tuning the Base Model

Another significant enhancement in ‘megaNet’ was the fine-tuning of the base model. While in ‘lilTinyNet’ the base model was frozen and used as a feature extractor, in ‘megaNet’ I decided to unfreeze the last 20 layers of the base model and train them on the CIFAR-10 data. In theory, this allowed the model to adapt better to the specific features of the CIFAR-10 dataset.

Fine-tuning was performed after the initial training of the top layers. The learning rate was reduced to 0.0001 to prevent large updates that could destroy the pre-learned features. The fine-tuned model was trained for up to 24 epochs, with early stopping if the validation loss did not improve for 5 consecutive epochs.

Adam Optimization

In ‘megaNet’, I decided to switch from RMSprop to Adam optimization. Adam is an optimization algorithm that can handle sparse gradients on noisy problems. It’s known for combining the best properties of the AdaGrad and RMSProp algorithms to provide an optimization algorithm that can handle a wide range of data and parameter scales.

Why the switch, you ask? Well, it was mostly a fun shot in the dark! I wanted to see if Adam could provide any improvements over RMSprop. And as it turns out, ‘megaNet’ seemed to enjoy the change!

Learning Rate Scheduler

Just like in ‘lilTinyNet’, ‘megaNet’ also incorporated a learning rate scheduler. This helped to achieve better convergence of the model. The learning rate was reduced by a factor of 10 every 5 epochs, just as it was in ‘lilTinyNet’.

Results and Insights

Before fine-tuning, the ‘megaNet’ model achieved a training accuracy of 94.81% with a loss of 0.1804! After fine-tuning, the accuracy slightly improved to 94.99% with a loss of 0.1777. The improvement from fine-tuning was much less than expected, which might suggest that the fine-tuning process might not have been performed properly. However, learning was reasonably steady throughout the training process. It’s possible that I could have let the model run for even more epochs for a slight increase. I did not log the exact time of this training, it was probably around 40m, but it wouldn’t be proper to compare the two anyway, as ‘megaNet’ had many more epochs in both new layer training and in fine-tuning.

The journey of developing ‘megaNet’ was filled with learning and experimentation. I chased every rabbit hole that I could to try and improve ‘lilTinyNet’, most of which were negative or stagnating changes. Through this experimentation I feel like I was much more able to grasp at how deep learning actually works and why they say hyperparameter tuning is a regular issue. I was very happy to finally get a positive result from data augmentation, and even though the fine-tuning didn’t result in barely any improvement I was just happy that it was improving with training!

The ‘megaNet’ model, with its more sophisticated architecture and advanced techniques, represents a significant improvement over ‘lilTinyNet’ architecture and performance, even if it was only an improvement of 3.82% it sure seemed like a lot after training larger models for longer with worse results!

Discussion

The journey of building these models was an enlightening experience filled with learning and experimentation. I delved into the intricacies of data preprocessing, discovered the power of transfer learning, and explored the impact of different hyperparameter tweaks. I also got to experience firsthand the utility of callbacks in Keras.

One of the key lessons I learned was the trade-off between model complexity and training time. The ‘lilTinyNet’ model, despite its simplicity and faster training time, achieved nearly the same accuracy as the more complex ‘megaNet’ model. This highlighted the importance of model selection and optimization in machine learning.

A significant challenge I faced was resizing the images from 32x32 to the input size that the pre-trained models were trained on. I used a lambda layer with the resize_images function from the Keras backend, but I am still exploring if this is the best approach.

Another challenge was choosing the appropriate base model for transfer learning. I chose MobileNetV2 because it is lightweight and efficient, but I am still uncertain if it is the best choice for the CIFAR-10 dataset. My early experiments with other models didn’t yield great results, but I am eager to continue experimenting.

Interestingly, I discovered that it’s possible to continue training an already trained and compiled model. This was a revelation to me, and it opened up new possibilities for model improvement. I initially ran ResNet50 for 2 epochs, but switched to MobileNetV2 because it was faster and had higher accuracy.

However, MobileNetV2 was overfitting, as evidenced by the decreasing validation accuracy and large validation loss. To combat this, I added dropout layers and adjusted the learning rate. I started with a learning rate of 0.0001, thinking that a lower rate would be beneficial since I was using a pretrained model with already established weights.

I used the Adam optimization algorithm because of its efficiency and because it requires little memory. Adam also adjusts the learning rate adaptively, which can lead to better results.

Intriguingly, I found that I could experiment with a very high learning rate on MobileNetV2. This led me to learn about ‘learning rate scheduling’, a technique that adjusts the learning rate during training. After implementing learning rate scheduling, dropout, early stopping, and model checkpointing, I readjusted the learning rate of MobileNetV2 by two orders of magnitude.

I ran the model for 10 epochs, but the model was saved on the 7th epoch with the lowest validation loss. The entire process took 11.88 minutes, demonstrating the efficiency of MobileNetV2.

In conclusion, this project was a valuable learning experience. It taught me the importance of model selection, hyperparameter tuning, and various techniques to combat overfitting. It also showed me that there’s always room for experimentation and improvement in machine learning.

Acknowledgments

I would like to thank the creators of the CIFAR-10 dataset and the developers of TensorFlow and Keras for providing the tools and resources necessary for this project. It has been great as a newcomer to learn that there are so many resources which can be utilized reasonably easily by anyone, anywhere.

I would also like to thank my GPU for its hard work and resilience throughout this journey. Ape, AI, and silicon together strong.

Literature Cited

Congratulations! Congratulations! Congratulations!

Appendices

You made it this far and for that you get code! *the crowd goes wild*

holbertonschool-machine_learning/supervised_learning/transfer_learning at master ·…

Contribute to spindouken/holbertonschool-machine_learning development by creating an account on GitHub.

github.com

(megaNet is 0-transfer and 0-main will evaluate the trained megaNet models)

Previous
Previous

Optimizing Alzheimer’s Disease Classification using Bayesian Optimization and Transfer Learning